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Abstract In this paper, we report a multiple sequence alignment result on the basis of 10 amino acid sequences of the M protein, 
which come from different coronaviruses (4 SARS-associated and 6 others known). The alignment model was based on the profile 
HMM (Hidden Markov Model), and the model training was implemented through the SAHMM (Self-Adapting Hidden Markov Model) 


software developed by the authors. 
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1 Introduction 


SARS is the first newly identified serious infectious 
disease that human being is facing at the beginning of 
the 21st century. It has been primarily recognized 
that a variant of virus from the coronavirus family 
might be the candidate pathogen of SARS, as reported 
by WHO (World Health Organization) on April 29, 
2003 (http: //www. who. int/csr/sarsco untry/en). 

Coronaviruses were first isolated from chickens in 
1937. There are now approximately 15 species in this 
family. Coronavirus particles are irregularly shaped, 
round about 60-220 nm in diameter, with an outer en- 
velope bearing distinctive, ‘club-shaped’ peplomers 
(round about 20nm long <X 10 nm at wide distal 
end)'!, This ‘ crown-like’ appearance ( Latin, 
corona) gives the family its name. 

The genome size of SARS-associated coronaviruses 
(isolate BJO1) is 29725kb and has 11 ORFs (Open 
Reading Frames). The whole genome is composed of 
a stable region encoding an RNA-dependent RNA poly- 
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merase (composed of 2 ORFs) and a variable region 
representing 4 CDSs (Coding Sequences) viral struc- 
tural genes (the S, E, M, N proteins) and 5 PUPs 
(Putative Uncharacterized Proteins)'?). Its gene or- 
der is identical to that of other known coronaviruses. 

The S (Spike) protein, the N(Nucleocapsid) pro- 
tein and perhaps together with the M protein appear 
to be the most important candidates for the future di- 
agnostic testing, preventing and treatment based on 
antibodies and vaccines, as well as exploring the im- 
munoreactions!?!. Due to the limit of page space, we 
choose the M protein as an illustrated example here. 
The M protein with transmembrane-budding and en- 
velope formation was predicted to be a mid-sized pro- 
tein (221 acid amino residues). It was located at the 
nucleotide position 26379-27044 (isolate BJ01)!?). 

For the M protein, by using the Blast method and 
the ClustalW 1.8 software (http:// www. ddbj. nig. 
ac. jp/E-mail/ clustalw-e. htm), the results on both 
the pair and the multiple sequence alignments have 
been respectively obtained and reported in literature. 
However, as far as the authors know, the multiple se- 
quence alignment result based on the profile HMM has 
not been seen yet. 

In this paper, we report some results about a multi- 
ple sequence alignment on the basis of 10 amino acid 
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sequences of the M protein, which come from differ- 
ent coronaviruses in NCBI databases (http://www. 
ncbi. nlm. nih. gov). They covered 4 SARS-associated 
coronaviruses isolated from patients in Canada, USA, 
and China (Beijing, Hong Kong), and 6 others: 2 
from human being (229E, Transmissible gastroenteri- 
tis), 3 from house animals ( Porcine, 
Turkey), and 1 from bird (Avian). 


2 Model and Method 


The alignment model is based on the profile HMM, 
and its topology as follows"! ; 


Bovine, 


Fig.1 Topology of the profile HMM 


The model training is implemented through the 
SAHMM software developed by the authors. The 
SAHMM software includes a two-stage alternative op- 
timization method to maximize Bayesian posterior 
probabilities of parameters and topology for a hidden 
Markov model. 

Let My denote the profile HMM with N main 
states, Ay the parameter set of the profile HMM (in- 
cluding the state transition probabilities and the sym- 
bol emission probabilities), O = {O°}, w=1,2,-~ 
W, the training sequence set, and T,, the length of 
the training sequence O‘””. 

The first step of two-stage alternative optimization 
method in the SAHMM software is parameter estima- 
tion that is to find A,, as the number of main states N 
is fixed. By using the Bayes formula, we have 


Ay =arg max P(Ay|0,Mw) 
arg max P(O| Ay ,My)P(Aw| My) (1) 
N 


in which P(O|Ay, My) is the likelihood function of 
the training sequence set O, P(Ay|My) is the prior 
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distribution of the parameter set A4y. We use Bayesian 
Baum-Welch algorithm plus simulated annealing to es- 
timate the parameters Ay of the profile HMM. The 
Baum-Welch algorithm is a variation of the more gen- 
eral EM algorithm. It iterates between an expectation 
step (E-step) and a maximization step (M-step). The 
iterative process continues until some stop rule is sat- 
isfied. 

The second step of two-stage alternative optimiza- 
tion method in the SAHMM software is the topology 
optimization that is to find the following My’. 


My* = arg max P( My | 0) 
N 
arg max P(O| My) P(Mw) (2) 


in which P( My) is the prior distribution of the model 
topology My. Under the assumption of a non-informa- 
tion prior distribution, we have 


P(My|0)<P(O| My) 
= J P(O|My.an) P(Ay| Mn) dan (3) 


Usually, the integral in Eq. (3) is difficult to calculate 
directly. Hence we use Bayesian Information Criteri- 
on (BIC)!*! to approximate it: 


BIC = - 2logP(O|A, ,Mn) + KnlogW (4) 


where Ky is the number of free parameters in the pro- 
file HMM with N main states, W is the sample size, 
and — logP(O|A,.My) is the maximized negative 
log-likelihood of training sequence set O. Then the 
optimum topology model My* is 


My* = arg min{ - 2logP(OlA,.My) + KylogW} (5) 


We have proved that P(OlAy,My) is a 
monotonously increasing function with respect to N, 
so the object function of (5) is a single peak function. 
We can use various optimum methods to solve (5), e. 
g. the golden section method. 


3 Data and Results 
3.1 Data 
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Organism 


SARS coronavirus BJ01 


SARS coronavirus CUHK-W1 


SARS coronavirus NC_004718.3 


SARS coronavirus urbani 


Transmissible gastroenteritis virus 


Human coronavirus 229E 


Porcine-epidemic diarrhea virus 


Bovine coronavirus 


Turkey coronavirus 


Avian-infectious bronchitis virus 


Accession 


AY278488. 2 


AY278554. 2 


NC. 004718.3 


AY278741.A 


NC _ 002306. 2 


NC_ 002645 


D49591 


AF220295. 1 


JQ1172 


M95169.1 


221a.a. 


221a.a. 


221a.a. 


262a.a. 


225a.a. 


226a.a. 


230a. a. 


230a. a. 


225a. a. 


Web site 


http://www. nebi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = protein&list - uids = 30275673&dopt = GenPept 


http://www. nebi. nlm. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = proteinglist - uids = 30023958&dopt = GenPept 


http://www. nebi. nlm. nih. gov/entrez/query. fcgi? cmd = 


Retrieve&db = protein&list-uids = 29836504&dopt = GenPept 


http://www. nebi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = proteind&list - uids = 30027623&dopt = GenPept 


http://www. ncbi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieveéddb = proteindlist - vids = 13399294&dopt = GenPept 


http://www. nebi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = protein&list-uids = 12175752&dopt = GenPept 


http://www. nebi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = protein&list - uids = 1360870&dopt = GenPept 


http://www. nebi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = protein&list - uids = 17529680&dopt = GenPept 


http://www. nebi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = protein&list . uids = 77083&dopt = GenPept 


http://www. nebi. nim. nih. gov/entrez/query. fcgi? cmd 
Retrieve&db = protein&list . uids = 292958&dopt = GenPept 


It 


3.2 Results 


The multiple sequence alignment of the M protein produced by the ClustalW (1.8) software 


BJO1-a 
CUHK-b 
NC-c 
urbani-d 


human-f 
porcine-g 
Bovine-h 
Turkey-i 
Avian-j 


BJ01-a 
CUHK-b 
NC-c 
urbani-d 


Seton Ue ets tessa ot ee ee a MADNGTITVEELKQLLEQWNLVIGFLFLAW 
Bet eos utes see al bees MADNGTITVEELKQLLEQWNLVIGFLFLAW 
Le vescsics pata saos ees Seka eee Se MADNGTITVEELKQLLEQWNLVIGFLFLAW 
deaths Read eee eee anne MADNGTITVEELKQLLEQWNLVIGFLFLAW 
Transmissible-e -MKILLILACVIACACGERYCAMKSDTDLSCRNSTASDCESCFNGGDLIWHLANWNF SWSIILIVF 
Webscebab sto esecncoe ses se Sees see MSNDNCTGDIVTHLKNWNFGWNVILTIF 
Secdedediesespescceee eso te Seebhee also MSNGSIPVDEVIEHLRNWNFTWNIILTIL 


MSSVTTPAPVYTWTADEAIKFLKEWNFSLGIILLFI 
MSSVTTPAPVYTWTADEAIKFLKEWNFSLGIILLFI 


Sethe Rectal oct aN i eis rte MPNETNCTLDFEQSVOLFKEYNLFITAFLLFL 


IMLLOQFAYSNRNRFLYI IKLVFLWLLWPVTLACFVLA--AVYRIN-WVTGGIAIAMACIVFLMWLS 
IMLLOFAYSNRNRFLY IIKLVFLWLLWPVTLACFVLA--AVYRIN-WVTGGIAIAMACIVGLMWLS 
IMLLOFAYSNRNRFLYIIKLVFLWLLWPVTLACFVLA--AVYRIN-WVTGGIAIAMACIVGLMWLS 
IMLLOPAYSNRNRFLYI IKLVFLWLLWPVTLACFVLA--AVYRIN-WVTGGIAIAMACIVGLMWLS 
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Transmissible-e 
human-f 
porcine-g 
Bovine-h 
Turkey-i 
Avian-j 


BJ01-a 

CUHK-b 

NC-c 

urbani-d 
Transmissible-e 
human-f 
porcine-g 
Bovine-h 
Turkey-i 
Avian-j 


BJ01-a 

CUHK-b 

NC-c 

urbani-d 
Transmissible-e 
human-f 
porcine-g 
Bovine-h 
Turkey-i 
Avian-j 
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ITVLOYGRPQFSWFVYGIKMLIMWLLWPVVLALT IFNAYSEYOQVSRYVMFGFSIAGAIVTFVLWIM 
IVILQFGHYKYSRLFYGLKMLVLWLLWPLVLALSIFDTWANWDSN-WAFVAFSFFMAVSTLVMWVM 
LVVLOYGHYKYSVFLYGVKMAILWILWPLVLALSLFDAWASFOQVN-WVF FAFSILMACITLMLWIM 
TVILQFGYTSRSMFVYVIKMVILWLMWPLTIILTIFN--CVYALN-NVYLGFSIVFTIVALIMWIV 
TIILQFGYTSRSMSVYVIKMI ILWLMWPLTI ILTIFN--CVYALN-NVYLGFSIVFTIVAIIMWIV 
TIILQYGYATRSKVIYTLKMIVLWCFWPLNIAVGVIS-~-CTYPPN-TGGLVAAI ILTVFACLSFVG 


YFVASFRLFARTRSMWSFNPETNILLNVPLR-GT IVTRPLMESELVIGAVI IRGHLRMAGHSLGR- 
YFVASGRLGARTRSMWSFNPETNILLNVPLR-GTIVTRPLMESELVIGAVI IRGHLRMAGHSLGR- 
YFVASFRLFARTRSMWSFNPETNILLNVPLR-GTIVTRPLMESELVIGAVI IRGHLRMAGHSLGR- 
YFVASFRLFARTRSMWSFNPETNILLNVPLR-GTIVTRPLMESELVIGAVI IRGHLRMAGHPLGR- 
YFVRSIQLYRRTKSWWSFNPETKAILCVSAL-GRSYVLPLEGVPTGVTLTLLSGNLYAEGFKIAGG 
YFANSFRLFRRARTFWAWNPEVNAITVTTVL-GOQTYYOQPIQQAPTGITVTLLSGVLYVDGHRLASG 
YFVNSIRLWRRTHSWWSFNPETDALLTTSVM-GROVC I PVLGAPTGVTLTLLSGTLLVEGYKVATG 
YFVNSIRLFIRTGSWWSFNPETNNLMC IDMK-GRMYVRPI TEDYHTLTVTIIRGHLYMQGIKLGTG 
YFVNSIRLFIRTGSWWSFNPETNNLMC I DMK-GRMYVRPI [EDYHTLTVTI IRGHLYMQGIKLGTG 
YWIQSIRLFKRCRSWWSFNPESNAVGSILLTNGQQCNFAIESVPMVLSPIIKNGVLYCEGQWLAK— 


CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYR I GNYKLNTDHAGSNDNIALLVOQ-- 
CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHAGSNDNIALLVQ-- 
CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHAGSNDNIALLVQ-- 
CDIKDLPKEITVATSR-TLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHAGSNDNIALLVQ-- 
MNIDNLPKYVMVALPSRTIVYTLVGKKLKASSATGWAYYVKSKAGDY STEAR- TDNLSEQEKLLHMV 
VQVHNLPEYMTVAVPSTT I IYSRVGRSVNSQNSTGWVFYVRVKHGDF SAVSS PMSNMTENERLLHFF 
VOVSQLPNFVTVAKATTT I VYGRVGRSVNASSGTGWAF YVRSKHGDY SAVSNPSAVLTDSEKVLHLV 
YSLSDLPAYVTVAKVS-HLLTYKRGFLDKIGDTSGFAVYVKSKVGNYRLPSTQKGSGLDTALLRNNI 
YSLSDLPAYVIVAKVS-HLLTYKRGFLDKIGDTSGFAVYVKSKVGNYRLPSTQKGSGMDTALLRNNI 
CEPDHLPKDIFVCTPDRRNI YRMVQKYTGDOSGNKKRFATFVYAKQSVDTGELESVATGGSSLYT~- 


The Multiple sequence alignment of the M protein produced by the SAHMM software 


BJO1-a MADNGTI--- ------~-- T VE-E-LKQLL EQWN------ --- LVI-GFLFLAWI----- 
CUHK-b MADNGTI-~- --------- T VE-E-LKQLL EQWN------ --- LVI-GFLFLAWI----- 
NC-c MADNGTI--- -------~-- {T VE-E-LKQLL EQWN------ --- LVI-GFLFLAWI----- 
urbani-d MADNGTI-~- --------- T VE-E-LKQLL EQWN------ --- LVI-GFLFLAWI----- 
Transmissible-e MKILLILACV IACACGERYC AM-K-SDTDL SCRNSTASDC ESCFNG-GDLIWHLANWNFS 
human-f M-SNDNC--~- --------- { GD-I--VTHL KNWNF----- ---' GWN-VILTIFIV----- 
porcine-g M-SNGSI--- ~-~-----~-- P VD-E-VIEHL RNWNF----- --- TWN-IILTILLV----- 
Bovine-h MSSVTTPAPV YTW------ T AD-E-AIKFL KEWNFS---- --- LGI--ILLFITV----- 
Turkey-i MSSVTTPAPV YTW---~-~- {? AD-E-AIKFL KEWNFS---- --- LGI--ILLFITI----- 
Avian-j MPNETNC-~- --------- T LDFEQSVQLF KEYN------ --- LFITAFLLFLTI----- 
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BJ01-a 
CUHK-b 
NC-c 
urbani-d 


MLLOQFAYSNR NRFLYIIKLV 
MLLOFAYSNR NRFLYIIKLV 
MLLOFAYSNR NRFLYIIKLV 


MLLOFAYSNR 


NRFLYIIKLV 


Transmissible-e WSIILIVFIT VL-QYGRPQF SWFVYGIKML 


human-f 
porcine-g 
Bovine-h 
Turkey-i 
Avian-j 


BJ01-a 

CUHK-b 

NC-c 

urbani-d 
Transmissible-e 
human-f 
porcine-g 
Bovine-h 
Turkey~i 
Avian-j 


BJ0l-a 

CUHK-b 

NC-c 

urbani-d 
Transmissible-e 
human-f 
porcine-g 
Bovine-h 
Turkey-i 


Avian-j 


BU0l-a 

CUHK-b 

NC-c 

urbani-d 
Transmissible-e 
human-f 


IL-QFGHYKY 
VL-QYGHYKY 
IL-QFGYTSR 


SRLFYGLEKML 
SVFLYGVKMA 
SMFVYVIKMV 


IL-QFGYTSR SMSVYVIKMI 


IL-QYGYATR 


SKVIYTLKMI 


-IAIAMACIV G--LMWLSYF VASFRLFART 
-IAIAMACIV G--LMWLSYF VASFRLFART 
-IAIAMACIV G--LMWLSYF VASFRLFART 
-IAIAMACIV G--LMWLSYF VASFRLFART 
F--VLWIMYF VRSIQLYRRT 
L--VMWVMYF ANSFRLFRRA 
L--MLWIMYF VNSIRLWRRT 
~---MWIVYF VNSIRLFIRT 
----MWIVYF VNSIRLFIRT 


~FSIAGAIVT 
-FSFFMAVST 
-FSILMACIT 
SIVFTIVAII 
SIVFTIVAII 
-LVAALILTV 


S-ELVIGAVI 
S-ELVIGAVI 
S~ELVIGAVI 
S-ELVIGAVI 
V~PTGVTLTL 
A-PTGITVTL 
A-PTGVTLTL 
D-YHTLTVTI 
D-YHTLTVTI 
SVPMVLSPITI 


SGFAAYNRYR 
SGFAAYNRYR 
SGFAAYNRYR 
SGFAAYNRYR 
TGWAYY-VKS 
TGWVFY-VRV 


FACLSFVGYW 


IRGHLRMAGH 
IRGHLRMAGH 
IRGHLRMAGH 
IRGHLRMAGH 
LSGNLYAEGF 
LSGVLYVDGH 
LSGTLLVEGY 
IRGHLYMQGI 


IQSIRLFKRC 


~SLGRCDIKD 
-SLGRCDIKD 
-SLGRCDIKD 
-PLGRCDIKD 
KIAGGMNIDN 
RLASGVQVHN 
KVATGVQVSQ 
KLGTGYSLSD 


IRGHLYMQGI KLGTGYSLSD 
KNGVLYCEGQ -WLAKCEPDH 
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FLWLLWPVTL A-CFVLA-AV YRI-NWVTGG 
FLWLLWPVTL A-CFVLA-AV YRI-NWVTGG 


FLWLLWPVTL 
FLWLLWPVTL 
IMWLLWPVVL 
VLWLLWPLVL 
ILWILWPLVL 
ILWLMWPLTI 
ILWLMWPLTI 
VLWCFWPLNI 


RSMWSFNPET 
RSMWSFNPET 
RSMWSFNPET 
RSMWSFNPET 
KSWWSFNPET 
RTFWAWNPEV 
HSWWSFNPET 
GSWWSFNPET 
GSWWSFNPET 
RSWWSFNPES 


LPKEITVA-T 
LPKEITVA-T 
LPKEITVA-T 
LPKEITVA-T 
LPKYVMVALP 
LPEYMTVAVP 
LPNFVTVAKA 
LPAYVTVAKV 
LPAYVTVAKV 
LPKDIFVCTP 


---IGNYKLN TD-HAGSNDN IALL--VQ 
---IGNYKLN TD-HAGSNDN IALL--VQ 
---IGNYKLN TD-HAGSNDN IALL--VQ 
---IGNYKLN TD-HAGSNDN IALL--VQ 
K--AGDYSTE AR-TDNLSEQ EKLLH-MV 
K--HGDFSAV SSPMSNMTEN ERLLH-FF 


A-CFVLA~AV 
A-CFVLA-AV 
ALTIFNAYSE 
ALSIFDTWAN 
ALSLFDAWAS 
ILTIFNC-VY 
ILTIFNC-VY 
A-VGVIS-CT 


NILLNVPL-R 
NILLNVPL-R 
NILLNVPL~-R 
NILLNVPL-R 
KAILCVSALG 
NAITVTTVLG 
DALLTTSV-M 
NNLMCIDMKG 
NNLMCIDMKG 
NAVGSILLTN 


SRTLSYYKLG 
SRTLSYYKLG 
SRTLSYYKLG 
SRTLSYYKLG 
SRTIVYTLVG 
STTIIYSRVG 
TTTIVYGRVG 
SHLLTY-KRG 
SHLLTY-KRG 
DRRNIYRMVQ 


YRI-NWVTGG 
YRI-NWVTGG 
YQVSRYVMFG 
WDS-NWAFVA 
FQV-NWVFFA 
ALN-NVYLGF 
ALN-NVYLGF 
YPP-N--TGG 


GTIVTRPLME 
GTIVTRPLME 
GTIVTRPLME 
GTIVTRPLME 
RSYV-LPLEG 
QTYY-QPIQO 
GRQVCIPVLG 
RMYV-RPIIE 
RMYV-RPIIE 
GQQC-NFAIE 


A--SQRVGTD 
A--SQRVGTD 
A--SQRVGTD 
A--SQRVGTD 
K--KLKASSA 
R--SVNSOQNS 
R--SVNASSG 


F--LDKIGDT 
F--LDKIGDT 


KYTGDQSGNK 
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porcine-~-g TGWAFY-VRS 
Bovine-h SGFAVY-VKS 
Turkey-i SGFAVY-VKS 
Avian-j KRFATF-VYA 
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